Learning to Classify Texts Using Positive and Unlabeled Data

نویسندگان

  • Xiaoli Li
  • Bing Liu
چکیده

In traditional text classification, a classifier is built using labeled training documents of every class. This paper studies a different problem. Given a set P of documents of a particular class (called positive class) and a set U of unlabeled documents that contains documents from class P and also other types of documents (called negative class documents), we want to build a classifier to classify the documents in U into documents from P and documents not from P. The key feature of this problem is that there is no labeled negative document, which makes traditional text classification techniques inapplicable. In this paper, we propose an effective technique to solve the problem. It combines the Rocchio method and the SVM technique for classifier building. Experimental results show that the new method outperforms existing methods significantly.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semi-Supervised Learning Based Prediction of Musculoskeletal Disorder Risk

This study explores a semi-supervised classification approach using random forest as a base classifier to classify the low-back disorders (LBDs) risk associated with the industrial jobs. Semi-supervised classification approach uses unlabeled data together with the small number of labelled data to create a better classifier. The results obtained by the proposed approach are compared with those o...

متن کامل

Positive-unlabeled convolutional neural networks for particle picking in cryo-electron micrographs

Cryo-electron microscopy (cryoEM) is fast becoming the preferred method for protein structure determination. Particle picking is a significant bottleneck in the solving of protein structures from single particle cryoEM. Hand labeling sufficient numbers of particles can take months of effort and current computationally based approaches are often ineffective. Here, we frame particle picking as a ...

متن کامل

Positive unlabeled learning via wrapper-based adaptive sampling

Learning from positive and unlabeled data frequently occurs in applications where only a subset of positive instances is available while the rest of the data are unlabeled. In such scenarios, often the goal is to create a discriminant model that can accurately classify both positive and negative data by modelling from labeled and unlabeled instances. In this study, we propose an adaptive sampli...

متن کامل

Semi-Supervised Sequence Classification with HMMs

Using unlabeled data to help supervised learning has become an increasingly attractive methodology and proven to be effective in many applications. This paper applies semi-supervised classification algorithms, based on hidden Markov models (HMMs), to classify sequences. For model-based classification, semisupervised learning amounts to using both labeled and unlabeled data to train model parame...

متن کامل

A Semi-Supervised Approach for Gender Identification

In most of the research studies on Author Profiling, large quantities of correctly labeled data are used to train the models. However, this does not reflect the reality in forensic scenarios: in practical linguistic forensic investigations, the resources that are available to profile the author of a text are usually scarce. To pay tribute to this fact, we implemented a Semi-Supervised Learning ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003